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Abstract 

We present an analytical method for determining the designability of protein 
structures. We apply our method to the case of two-dimensional lattice struc- 
tures, and give a systematic solution for the spectrum of any structure. Using 
this spectrum, the designability of a structure can be estimated. We outline 
a heirarchy of structures, from most to least designable, and show that this 
heirarchy depends on the potential that is used. 
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The protein design problem asks which and how many amino-acid sequences fold into 
a given protein structure. This problem, having obvious practical and evolutionary signifi- 
cance, attracted considerable attention and effort of experimentalists |Q and theorists 
In particular. Tang and coworkers carried out a thorough study of the design problem for 
a simple square and cubic lattice model of proteins These authors enumerated all se- 
quences made of two types of monomers and all conformations for 2-dimensional 36-mers 
and 3-dimensional 27-mers. This calculation provided a density of states in sequence space 
for the studied models, i.e. how many sequences have a given energy in a given conformation. 
For each conformation, the "designability," which is defined as the number of sequences that 
have this conformation as a ground state, was determined in 0. Interestingly, it was found 
that designability varied from conformation to conformation. Furthermore, the structures 
that possess certain degrees of symmetry turned out to be most designable 0. This result 
tempts one to speculate that the factor of designability may be the cause of symmetries 
observed in real proteins. 

The physical reason for the observed relationship between geometric properties of confor- 
mations and their designability remained unclear. In particular, it is important to evaluate 
how robust is this relationship with respect to choice of lattice model, model protein "al- 
phabet," and interaction parameters between model aminoacids. Recent numeric studies @ 
and showed that the set of the most designable structures does indeed depend on the 
potential and aminoacid alphabet used. 

A deeper understanding of the designability problem requires an analytical solution that 
may shed light on the general geometric features that make structures designable. Such a 
solution will facilitate an informed connection between model results and real proteins. 

In this Letter we present a statistical mechanical analysis of designability, and give an 
analytical solution to the problem on a two-dimensional lattice. Our analysis is based on the 
analogy between protein design and a class of statistical-mechanical spin models established 
in earlier works 0,^. 

We use the standard lattice model to represent protein conformations. On a lattice, a 
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structure is defined as a self-avoiding walk of length N. The energy of a sequence of residues 
{cTj} in a structure with coordinates of the monomers {rj} is 

N 

Hi{a,},{r,}) = J2Uia.,aj)Ain,r,) (1) 

i<j 

A{ri,rj) is defined to be 1 when and rj are nearest-neighbors on the lattice and not se- 
quence neighbors along the chain. For real proteins the definition of a contact requires a 
choice of a cut-off distance. The folding potential is given by a symmetric matrix f/(a,/3) 
whose entries are the pairwise interaction energy between aminoacid residues of types a and 
(3. 

In any instance of the design problem, the structure {r°} is given (we will refer to this 
structure as the "target conformation" for design), and we are asked to find all sequences 
whose native (folded) state is {r°}. The number J\f of such sequences is the designability of 
the structure. Since the residues of a protein are tightly packed in space, only maximally 
compact lattice structures are generally studied. On a square lattice these correspond to 
self-avoiding walks that completely fill a square. The folding potential H is therefore as- 
sumed to have a non-specific attractive term (which we have omitted), that favors compact 
conformations. 

In order for a sequence to fold and to be stable in the desired target conformation it 
must have lowest energy in this conformation compared to its energy in all alternative con- 
formations. The spectrum of energies of alternative conformations is qualitatively similar to 
that of a random heteropolymer, i.e. it consists of a continuum part, with a high density of 
states whose lowest energy is Ec, and of a few discrete energy levels lying below Ec P,p!0|. 
Importantly, the value of Ec is self-averaging H, i.e it depends on amionacid composition 



rather than on a particular sequence |T3|,|^. While this description is strictly applicable to 
three-dimesional heteropolymers (see though it was shown in |12[ that it qualitatively 
applies to two-dimenisonal case. 

Thus the issue of designability is reduced to the question of how many sequences with 
a given aminoacid composition can fold into a given conformation with an energy E <^ E^. 



Each structure has a sequence-space energy spectrum given by the energy density 



(2) 



The above expression corresponds to a system of spins {(Xj} confined to an interaction 
geometry dictated by A(rj,rj), with CTj G 1 . . .q, where q is the number of possible residue 
types. In this language the aminoacid composition of a sequence is equivalent to total 
magnetization and is denoted by the letter C, which is in general a g-dimensional vector. 
The probability p{E, C) that a sequence {a^} of composition C, will fold into a structure A, 



where Tc is the temperature of the thermodynamic "freezing" transition in random het- 
eropolymers having composition C, and 7 is the effective number of conformations per 
monomer [^f. These properties of Ec render it a valuable link between sequence and struc- 
ture space. If we know the spectrum n{E, A, C) of a structure A over all sequences of fixed 
composition C, we can compute its expected designability as follows: 



On a two-dimensional square lattice, maximally compact structures have a property that 
makes the spectrum calculation n{E, A, C) analytically tractable. If the ends of the chain 
are prescribed to be neutral - they do not interact with any monomers - then the entire 
structure can be decomposed into a direct product of one-dimensional loops and strands. A 
loop is a collection of 5* monomers in which monomer i is in contact with monomer i + for 
1 < i < 5 — 1, and monomer 1 is in contact with monomer 5*; if monomer 1 and monomer 
5* are not in contact, we call the collection a strand. Figure 1 shows the decomposition 
of a particular 36-mer into strands and loops (ends have been allowed to interact in this 
figure). This decomposition derives from the restriction that nearest neighbors along the Fig 
chain are not in contact, leaving each monomer in the interior of a structure exactly 2 



where E = H{{ai},A), is g 
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contacts to make. Since this excludes branched systems of contacts, the only possibilities 
left are loops and strands. Monomers lying on the square border of a structure can make 
only one contact because the chain must occupy two of the three neighboring sites. Strands, 
therefore, must begin and end on the border of a structure, and loops must lie entirely 
within it. Since the corners of a structure cannot interact, a square structure of length 

has A^/N — 8 interacting border sites. Thus, there must be — 4 strands in each 

structure. Including the ends of the chain would add at most 2 branched systems, or would 
change the number of strands by at most 2. While the number of strands is largely the 
same for all structures, the distribution of lengths of strands is structure-dependent, and 
will prove to be the determinant of designability. 

We restrict this discussion to systems with two types of aminoacids, which we denote 
+ 1 and -1, although the general methodology can certainly be applied to larger alphabets. 
Following the analogy between protein design and spin models [0,0, aminoacid types will 
be called "spins" in what follows. The interaction energies are taken to be f/(l, 1) = a, 
f/(— 1,— 1) = b, and f/(l,— 1) = f/(— 1,1) = d. The general aim is to compute n{E, A,C), 
which can be deduced from the partition function at composition C, over all sequences {cTj}: 

^(c) = Ee-^''^"^5(c-E^0 (5) 

Note that composition in the 2-spin case is a scalar corresponding to the net magnetization. 
The delta function can be expressed as a Fourier integral, and if we define the parition 
function for the j-th loop or strand to be 

Zj{iu) = J2 e~^^^"»>-*"2-« (6) 

where is the set of all spins in the j-th loop or strand, we can write 

Z{C) = J e'^'^duflZjiuj) (7) 

where s and / are the number of strands and loops, respectively. Zj{u) can be computed by 
the transfer matrix method; we define the transfer matrix T as 
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T 



Ae-'"^ D 
D Be''^ 



J 



(8) 



where A = e i? = e and D = e If A+ and A_ are the eigenvalues of T, we have 
for a loop, Zj{uj) = \^ + X"L\ while for a strand 



E 



'((Tl+(T,i . )iuj/2 



(9) 



where rij is the length of the j-th loop or strand. The expression for strands is more 
complicated than that for loops because the first and last monomers in a strand have to be 
summed over separately. By diagonalizing T, the following expression for strands can be 
derived: 



\2D - A- B){Xl' ^-Xl' V 



(e*" + e-*")(A+' - XI') /(A+ - A_) 



(10) 



The eigenvalues can be calculated, and one observes that if = AB, we get A+ = Ae~'^^ + 
Be^"^ , while A_ = 0. This situation corresponds to any potential for which d = {a + h)/2. 
We will call this the symmetric potential, because any such potential is the same as a = 1, 
h = —1, d = 0. Under the symmetric potential, Zj^u) = A"^ for loops, and Zj{uj) for strands 
simplifies considerably to 



Zj{uj) = X'^~\VAe'''^ + y/Be 



(11) 



If is the chain length of a structure, the total partition function (^) for the symmetric 
potential now becomes 

(12) 



Z{C) = J e^'^^duX^-^'iVAe-''^ + VBe'^^f 



Using the binomial expansion, the integral can be calculated exactly and the density of 
states, n{E,C) obtained. The following is a plot of n{E,C) using a = —1, 6 = 1, N = 36, 
and C = 0: Fig.2 
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Under the symmetric potential, a structure's spectrum depends only on its length, com- 
position, and the number of strands. Since these three are the same for each structure, 
the spectra of all structures are identical (up to end-effects). This result implies that all 
structures have the same expected designability, with fluctuations only due to the effects of 
interacting ends and to any sequence-dependent terms of Ec- For large N these terms will 
be insignificant. 

The symmetric potential is a singular case in that structural features play no role in 
designability. It has previously been identified in 0] as an important special case for design. 
We now show how designability becomes sensitive to structure under perturbations of this 
potential. 

Under the symmetric potential, one eigenvalue is zero. If we add a small perturbation u 
to the potential, so that a' = a + u, b' = b, and c' = {a + b)/2, we would expect one eigenvalue 
still to be small compared to the other. Under a perturbed potential, then, A_/A+ ^ 1. 
Starting from the general expression for strands (p!OD, we observe that if A_ < A+, we have 
two geometrical series in R = A_/A+, and we can rewrite it as: 



Zj{uj) = A+^^^0 



nj—2 

1 + ^ i?^ - (e*" + e~''^)—R''-^ 
k=i ^ 



(13) 

I k=l J 

where G = Ae~'^^'^ + Be'^^'^ + 2D. The corresponding expression for loops is 

Z,{u;) = \l\l + R-^) (14) 

Using these expressions in equation (|^) for Z{C) results in a product of the loop and strand 
Zj^s. We can expand this product in R, and obtain the following kinds of terms: The 
principal contribution - A^^^'^0'' - (zero-order in R) is a structure- independent term corre- 
sponding to the symmetric potential. Terms coming from a single loop of length rij look like 
^jv-2s0s^nj 'jg2;]2is comiug from a single strand of length n,- are: 
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R^ - (e^^ + e-*")^i?'=-^ 



(15) 



where 1 < k < rij — 2. Strands of length 2 contribute no terms. Thus, strands have terms 
at all powers of R less than rij — 1, while loops only perturb at R^^ . Since the smallest 



possible loop is of length 4, strands of lengths greater than 2 are much more important to 
the structure of the spectrum than loops of any length. 

To calculate a perturbed spectrum, one still has to back-transform the Fourier variable 
oj in each term. Setting U = e~^", the perturbed eigenvalues must satisfy A+ + A_ = 
AUe''^'^ + Be^'^ and A_|_A_ = AB{U — 1). The principal contribution to the partition function 
is 



which can be computed by binomial expansions of A^~^*. The only difficulty in doing so 
comes from odd powers of a square-root term in the expression for A_|_ which would normally 
lead to a power series expansion rather than a finite binomial expansion. The approximation 
we make in our computation is to replace the odd powers of the square root with the next 
even number, so that we have terminating binomial expansions at all powers. To compute 
a perturbation of the form X^~''^'^Q'^R'' we use R = AB{U — 1)/A^ and obtain: 



which is similar to Zprindpai except for another binomial that can be expanded. Computing 
perturbations of any order is therefore relatively straightforward. 

We performed computations to check our theory with a full enumeration of sequences and 
structures that was published by Tang and coauthors [Q. These authors used a potential 
(a = —2.3, 6 = 0, (i = —1) which can be rescaled to the perturbed potential a = —1, 
b = 1, d = 0, u = —0.3. The following is a log plot of the principal (structure-independent) 
contribution to the spectrum of a 36-mer: Fig-3 

This lowest-order approximation of the density of states bears a striking resemblance to 
the spectrum calculated by full enumeration for 3D structures (C.Tang private communi- 
cation). Unfortunately our request to provide sample densities of states for 2D structures 
had not been granted by the authors of Qualitative features such as the shape and 
the anomalous speckeled pattern emerge from our theory. The pattern of the spectrum is 




(16) 




(17) 



8 



essentially a splitting of the states of the symmetric potential due to the perturbation u. 

The expected designability depends on the value of Ec- The following is a plot of the 
change in expected designability due to the first-order strand perturbation as a function of 
Ec- Note that the logarithm of negative numbers was taken in absolute value and plotted 
as a negative value. The change in the expected designability of the structure due to Fig 
this perturbation is negative for a large range of Ec in the lower part of the spectrum. Since 
strands of lengths greater than 2 will contribute a first-order perturbation, the more such 
strands in a structure, the less designable it will be. The most designable structures will, 
therefore, contain the maximum number of strands of length 2. This is the major effect 
determining designability; adding higher-order terms does not change this result. The effect 
which 2-strands have on designability is due to the condition -u < 0. One can also perturb 
with M > and obtain the opposite result: 2-strands are not desirable. All the plots shown 
here correspond to composition C = 0. Similar plots are obtained over a large range of C, 
with significant qualitative differences for \C\ close to A^, i.e. as the sequence approaches a 
homopolymer. Our conclusion remains unchanged by these differences because at extreme 
compositions we expect very few sequences to design any given structure. 

Deciding whether or not a strand of length n is better for designability than a strand 
of length n + 1 requires looking at the change in expected designability due to higher order 
perturbations. We find that up to strands of lengths 5, the shorter the strand, the more 
designable the structure. For longer lengths, differences in designability become very small, 
and resolution of these differences requires precise knowledge of E^. 

We have presented a general, analytical approach to the protein design problem, and 
by applying it to the case of two-dimensional structures, we have shown how it effectively 
recognizes features that make some structures much more designable than others. This al- 
lowed us to identify a heirarchy of designability based on strand lengths. Future work should 
generalize these results to larger alphabets (which leads to a g-th degree polynomial in A), 
and to lattices of dimension three (which leads to highly branched systems of contacts). 

We thank A.Grosberg for useful comments. The work was supported by NSF scholarship 
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FIGURE CAPTIONS 

FIG. 1. Strand/Loop decomposition for a 36-mer 



FIG. 2. Symmetric potential spectrum of a 36-mer 



FIG. 3. Principal contribution to 36-mer spectrum with u = -0.3 



FIG. 4. Change in expected designabihty due to Ist-order perturbation 
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Kassel and Shakhnovich Fig. 2 



Log n [E ] 




Kassel and Shakhnovich Fig. 3 



Log [change m 



designability ] 




Kassel and Shakhnovich Fig 



